Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🌱 fix(e2e): wait for leader election #1676

Conversation

camilamacedo86
Copy link
Contributor

@camilamacedo86 camilamacedo86 commented Jan 31, 2025

TestClusterExtensionAfterOLMUpgrade was failing due to increased leader election timeouts, causing reconciliation checks to run before leadership was acquired.

This fix ensures the test explicitly waits for leader election logs ("successfully acquired lease") before verifying reconciliation.

Example: https://github.com/operator-framework/operator-controller/actions/runs/13047935813/job/36401741998

Logs from operator-controller;

I0130 08:12:51.290826       1 main.go:427] "starting manager" logger="setup"
I0130 08:12:51.291050       1 server.go:208] "Starting metrics server" logger="controller-runtime.metrics"
I0130 08:12:51.291109       1 server.go:83] "starting server" name="health probe" addr="[::]:8081"
I0130 08:12:51.291121       1 server.go:247] "Serving metrics server" logger="controller-runtime.metrics" bindAddress=":8443" secure=true
I0130 08:12:51.291209       1 leaderelection.go:257] attempting to acquire leader lease olmv1-system/9c4404e7.operatorframework.io...


@camilamacedo86 camilamacedo86 requested a review from a team as a code owner January 31, 2025 03:42
Copy link

netlify bot commented Jan 31, 2025

Deploy Preview for olmv1 ready!

Name Link
🔨 Latest commit 25ffe30
🔍 Latest deploy log https://app.netlify.com/sites/olmv1/deploys/679cdc61ed84a200085c3178
😎 Deploy Preview https://deploy-preview-1676--olmv1.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@camilamacedo86 camilamacedo86 force-pushed the fix-test-leader-election branch 4 times, most recently from a96e69f to 6b04b01 Compare January 31, 2025 04:09
Copy link

codecov bot commented Jan 31, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 67.48%. Comparing base (e77c53c) to head (25ffe30).
Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1676      +/-   ##
==========================================
- Coverage   67.50%   67.48%   -0.03%     
==========================================
  Files          57       57              
  Lines        4632     4632              
==========================================
- Hits         3127     3126       -1     
- Misses       1278     1279       +1     
  Partials      227      227              
Flag Coverage Δ
e2e 53.40% <ø> (ø)
unit 54.27% <ø> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@camilamacedo86
Copy link
Contributor Author

@camilamacedo86 camilamacedo86 force-pushed the fix-test-leader-election branch 3 times, most recently from 0ee61e2 to 4238c2b Compare January 31, 2025 05:07
@@ -40,6 +40,20 @@ func TestClusterExtensionAfterOLMUpgrade(t *testing.T) {
t.Log("Wait for operator-controller deployment to be ready")
managerPod := waitForDeployment(t, ctx, "operator-controller-controller-manager")

t.Log("Start measuring leader election time")
Copy link
Contributor

@perdasilva perdasilva Jan 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to be careful about how we measure the timing here. What we are measuring right now is the amount of time between:

  • the test detecting that the operator-controller deployment is finished, and
  • how long it takes for watchPodLogsForSubstring(leaderElectionCtx, managerPod, "manager", leaderSubstrings...) to return

This may correlate with the time taken for leader election, but it won't necessarily correlate with it. E.g. let's say I upgrade the deployments, go out for lunch for 1h, come back and run the post upgrade test.

Maybe it would be better to extract the timestamp in the first and leader election log lines instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your comment make 100% sense.

To try keep things simple and focused on the goal of this PR, I’ve removed the measurement aspect. Whether we want to include it as info, debug, or decide on a specific measurement approach is a separate discussion. For now, let’s stay within the scope of this change—fixing the test flake and unblocking progress.

TestClusterExtensionAfterOLMUpgrade was failing due to increased leader
election timeouts, causing reconciliation checks to run before leadership
was acquired.

This fix ensures the test explicitly waits for leader election logs
(`"successfully acquired lease"`) before verifying reconciliation.
@camilamacedo86 camilamacedo86 force-pushed the fix-test-leader-election branch from 4238c2b to 25ffe30 Compare January 31, 2025 14:21
@camilamacedo86 camilamacedo86 changed the title 🌱 fix(e2e): wait for leader election & measure timing for better monitoring 🌱 fix(e2e): wait for leader election Jan 31, 2025
Copy link
Contributor

@perdasilva perdasilva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm ^^

t.Log("Wait for acquired leader election")
// Average case is under 1 minute but in the worst case: (previous leader crashed)
// we could have LeaseDuration (137s) + RetryPeriod (26s) +/- 163s
leaderCtx, leaderCancel := context.WithTimeout(ctx, 3*time.Minute)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am assuming 3 minutes is the worst case scenario. I am not familiar with context.WithTimeout , does it return if we acquire the lease before 163s?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

context.WithTimeout just gives you a context that timesout (gets cancelled) after then timeout period.
This means that the call to watchPodLogsForSubstring(leaderCtx, managerPod, "manager", leaderSubstrings...) will return with an error if it hasn't already after 3 minutes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, looks like it is a straight forward timeout method.

defer leaderCancel()

leaderSubstrings := []string{"successfully acquired lease"}
leaderElected, err := watchPodLogsForSubstring(leaderCtx, managerPod, "manager", leaderSubstrings...)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scraping the logs seems brittle.

Would it be better to use a Watch on the leader election? We could use the Leases from CoordinationV1Client from "k8s.io/client-go/kubernetes/typed/coordination/v1" ?

Copy link
Contributor

@bentito bentito Jan 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize it's also longer and more code, but the upside is it reacts right away, like watching for the pod log, but without caring if strings change at some point, and break our tests out of our control.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a good idea! If this work is blocking CI, I'd say merge it as it is, then follow up with the watch ^^

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree that we could do something more fancy
But we check the logs in many places, indeed below.
We can see if we improve after, but there is no reason for us to face the pain of the flak.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants